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This paper presents generalized probabilistic models for high-order projective dependency pars¬ 
ing and an algorithmic framework for learning these statistical models involving dependency 
trees. Partition functions and marginals for high-order dependency trees can be computed 
efficiently, by adapting our algorithms which extend the inside-outside algorithm to higher-order 
cases. To show the effectiveness of our algorithms, we perform experiments on three languages — 
English, Chinese and Czech, using maximum conditional likelihood estimation for model train¬ 
ing and L-BFGS for parameter estimation. Our methods achieve competitive performance for 
English, and outperform all previously reported dependency parsers for Chinese and Czech. 

1. Introduction 

Dependency parsing is an approach to syntactic analysis inspired by dependency grammar. 


In recent years, several domains of Natural Language 

Processing have benefited from 

dependency representations, such as synonym generation ( 

Shinyama, Sekine, and Sudo 2002 i. 

relation extraction ( 

Nguyen, Moschitti, and Riccardi 20091 and machine transla- 

tion (IKatz-Brown et al. 201 11 Xie. Mi, and Liu 20111. A primary reason for using dependency 


structures instead of more informative constituent structures is that they are usually easier to be 
understood and is more amenable to annotators who have good knowledge of the target domain 
but lack of deep linguistic knowledge (lYamada and Matsumoto 2003 l l while still containing 
much useful information needed in application. 

Dependency structure represents a parsing tree as a directed graph with different 
labels on each edge, and some methods based on graph models have been applied to 
it and achieved high performance. Based on the report of the CoNLL-X shared task on 
dependency parsing (IBuchholz and Marsi 20061 INivre et al. 20071 1. there are currently two 
dominant approaches for data-driven dependency parsing; local-and-greedy transition- 
based algorithms (lYamada and Matsumoto 20031 INivre and Scholz 20041 lAttardi 20061 
IMcDonald and Nivre 2007l l. and globally optimized graph-based algorithms (lEisner 19961 
[McDonald, Crammer, and Pereira 2005j IMcDonald et al. 20051 IMcDonald and Pereira 20061 
ICarreras 20071 IKoo and Collins 2010b . and graph-based parsing models have achieved state-of- 
the-art accuracy for a wide range of languages. 
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There have been several existing graph-based dependency parsers, most of 
which employed online learning algorithms such as the averaged structured per- 
ceptron (AP) (Freund and Schapire 199^ ICollins 20021 1 or Margin Infused Relaxed 
Algorithm (MIRA) ( Crammer and Singer 2003[ ICrammer et al. 20061 (McDonald 20061) for 
learning parameters. However, One shortcoming of these parsers is that learning parameters of 
these models usually takes a long time (several hours for an iteration). The primary reason is 
that the training step cannot be performed in parallel, since for online learning algorithms, the 
updating for a new training instance depends on parameters updated with the previous instance. 

Paskin (120011 1 proposed a variant of the inside-outside algorithm dBaker 19791 1. which were 
applied to the grammatical bigram model dEisner 19961) . Using this algorithm, the grammatical 
bigram model can be learning by off-line learning algorithms. However, the grammatical bigram 
model is based on a strong independence assumption that all the dependency edges of a tree are 
independent of one another. This assumption restricts the model to first-order factorization (sin¬ 
gle edge), losing much of the contextual information in dependency tree. Chen et.al d2010b 
illustrated that a wide range of decision history can lead to significant improvements in accuracy 
for graph-based dependency parsing models. Meanwhile, several previous works dCarreras 20071 
IKoo and Collins 2010t have shown that grandchild interactions provide important information 
for dependency parsing. Therefore, relaxing the independence assumption for higher-order parts 
to capture much richer contextual information within the dependency tree is a reasonable im¬ 
provement of the bigram model. 

In this paper, we present a generalized probabilistic model that can be applied to any 
types of factored models for projective dependency parsing, and an algorithmic framework 
for learning these statistical models. We use the grammatical bigram model as the back¬ 
bone, but relax the independence assumption and extend the inside-outside algorithms to ef¬ 
ficiently compute the partition functions and marginals (see Section 12.41) for three higher- 
order models. Using the proposed framework, parallel computation technique can be em¬ 
ployed, significantly reducing the time taken to train the parsing models. To achieve em¬ 
pirical evaluations of our parsers, these algorithms are implemented and evaluated on three 
treebanks—Penn WSJ Treebank (Marcus, Santorini, and Marcinkiewicz 1993) 1 for English, Penn 
Chinese Treebank (IXue et al. 20051) for Chinese and Prague Dependency Treebank (Haj ic 19M} 
|Hajic et al. 200 1| ) for Czech, and we expect to achieve an improvement in parsing performance. 
We also give an error analysis on structural properties for the parsers trained by our framework 
and those trained by online learning algorithms. A free distribution of our implementation has 
been put on the Internet^. 

The remainder of this paper is structured as follows: Section |2] describes the probabilistic 
models and the algorithm framework for training the models. Related work is presented in 
Section [3] Section IHpresents the algorithms of different parsing models for computing partition 
functions and marginals. The details of experiments are reported in Section |5l and conclusions 
are in Section|6] 


2. Dependency Parsing 

2.1 Background of Dependency Parsing 

Dependency trees represent syntactic relationships through labeled directed edges of words and 
their syntactic modifiers. Eor example, Eigure [T] shows a dependency tree for the sentence. 
Economic news had little effect on financial markets, with the sentence’s root-symbol as its root. 


1 http://soureeforge.net/projects/maxparser/ 
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sbj 

nmod news 


had 


nmod 

little 


(a) 


obj 

effect 


nmod 


nmod 

financial 


(b) 


Root Economic news had little effect 

markets 


on financial markets 


Figure 1 

An example dependency tree. 


By considering the item of crossing dependencies, dependency trees fall into two 
categories—^projective and non-projective dependency trees. An equivalent and more convenient 
formulation of the projectivity constrain is that if a dependency tree can be written with all 
words in a predehned linear order and all edges drawn on the plane without crossing edges (see 
FigureiHb)). The example in Figure [Ubelongs to the class of projective dependency trees where 
crossing dependencies are not allowed. 

Dependency trees are often typed with labels for each edge to represent additional syntactic 
information (see Figure[TJa)), such as sbj and obj for verb-subject and verb-object head-modiher 
interactions, respectively. Sometimes, however, the dependency labels are omitted. Dependency 
trees are defined as labeled or unlabeled according to whether the dependency labels are included 
or dropped. In the remainder of this paper, we will focus on unlabeled dependency parsing for 
both theoretical and practical reasons. From theoretical respect, unlabeled parsers are easier to 
describe and understand, and algorithms for unlabeled parsing can usually be extended easily 
to the labeled case. From practical respect, algorithms of labeled parsing generally have higher 
computational complexity than them of unlabeled version, and are more difficult to implement 
and verify. Finally, the dependency labels can be accurately tagged by a two-stage labeling 
method (IMcDonald 20061) . utilizing the unlabeled output parse. 

2.2 Probabilistic Model 

The symbols we used in this paper are denoted in what follows, x represents a generic input 
sentence, and y represents a generic dependency tree. T(a;) is used to denote the set of possible 
dependency trees for sentence x. The probabilistic model for dependency parsing dehnes a 
family of conditional probability Pr(y|a;) over all y given sentence x, with a log-linear form: 

’'(''I*’= 

where Fj are feature functions, A = (Ai, A 2 ,...) are parameters of the model, and Z{x) is a 
normalization factor, which is commonly referred to as the partition function: 

yeT{x) ^ j 
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2.3 Maximum Likelihood Parameter Inference 

Maximum conditional likelihood estimation is used for model training (like a CRF). For a set of 
training data {{xk, Vk)}, the logarithm of the likelihood, knows as the log-likelihood, is given 
by: 


L(A) = logJ|Pr(yfc|a;fe) 

k 




k 


E 


^\Fj{yy,,Xk) - log Z{xk) 

^ 3 


Maximum likelihood training chooses parameters such that the log-likelihood L{\) is max¬ 
imized. This optimization problem is typically solved using quasi-Newton numerical methods 
such as L-BFGS (INash and Nocedal 19911 . which requires the gradient of the objective func¬ 
tion: 


dL{\) 

dXj 


E 

k 

E 


d\og'Px{yk\xk) 


dXj 


k ^ 


Fj iVk^ ®fc) 


i91og2:(a;fc) 

dX-i 


E P^{y\xk)Fj{y,Xk) 

k yeT{xk) 


( 1 ) 


The computation of Z{x) and the second item in summation of Equation ([T]) are the difficult 
parts in model training. In the following, we will show how these can be computed efficiently 
using the proposed algorithms. 


2.4 Problems of Training and Decoding 

In order to train and decode dependency parsers, we have to solve three inference problems which 
are central to the algorithms proposed in this paper. 

The hrst problem is the decoding problem of hnding the best parse for a sentence when 
all the parameters of the probabilistic model have been given. According to decision theory, a 
reasonable solution for classihcation is the Bayes classifier which classify to the most probable 
class, using the conditional distribution. Dependency parsing could be regarded as a classihcation 
problem, so decoding a dependency parser is equivalent to hnding the dependency tree y* which 
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has the maximum conditional probability: 

y* = argmaxPr(y jx) 

yeT(a;) 

= argmaxlogPr(y|x) 

yeT(a;) 

= argmax V X^Fj (y, x) I. (2) 

yeT(x) I ^ J 

The second and third problems are the computation of the partition function Z{x) and the 
gradient of the log-likelihood (see Equation ([T])). 

From the definition above, we can see that all three problems require an exhaustive search 
over T(x) to accomplish a maximization or summation. It is obvious that the cardinality of 
T(x) grows exponentially with the length of x, thus it is impractical to perform the search 
directly. A common strategy is to factor dependency trees into sets of small parts that have 
limited interactions: 


(3) 

pey 

That is, dependency tree y is treated as a set of parts p and each feature function Fj{y,x) is 
equal to the sum of all the features fj{p, x). 

We denote the weight of each part p as follows: 

w{p, x) = exp j ^ ^jfj(p, x) 

'' j 

Based on Equation 0 and the definition of weight for each part, conditional probability Pr(y |x) 
has the the following form: 


V / t J pgj, 


1 


Z{x) 


exp 1 


pey i 


Z{x) 


]^w;(p,x) 


p&y 


Furthermore, Equation 0 can be rewritten as: 

y* = argmax^ logic(p, x), 
yeT(x) 

and the partition function Z{x) and the second item in the summation of Equation 0 are 


Z{x)= ^ Y[w{p,x) 

yeT{x) '-pep 


5 













Technical Report 


Year 2012 


and 


PT{y\xk)Fj{y,Xk) 

yeT{xk) 

= E E Pr{y\xk)fjip,xk) 

yeT{xk) P<^y 

= E E fj(p^Xk)Priy\xk) 

peP(xk) yeT(p,Xk) 

= E Y Pr(y|a;fe), 

pePlrCfc) yeT{p,Xk) 

where T(p, x) = {y G T{x)\p G y} and P(x) is the set of all possible part p for sentence x. 
Note that the remaining problem for the computation of the gradient in Equation ([TJ is to compute 
the marginal probability m{p) for each part p: 


™(P) = E Pr(y|tc). 

yeT(p.x) 


Then the three inference problems are as follows: 

Problem 1 : Decoding 

y* = argmax 'Y^ logtu(p, x). 

yeT(x) 

Problem 2: Computing the Partition Function 

z{x) = Y • 

yGT{x) '-pey 


Problem 3: Computing the Marginals 

m{p) = 'Y^ Pr(y|a;), for all p. 

yGT(p.a:) 


2.5 Discussion 

It should be noted that for the parsers trained by online learning algorithms such as AP or MIRA, 
only the algorithm for solving the decoding problem is required. However, for the motivation 
of training parsers using off-line parameter estimation methods such as maximum likelihood 
described above, we have to carefully design algorithms for the inference problem 2 and 3. 

The proposed probabilistic model is capable of generalization to any types of parts p, and 
can be learned by using the framework which solves the three inference problems. For different 
types of factored models, the algorithms to solve the three inference problems are different. 
Following Koo and Collins (120101 1. the order of a part is defined as the number of dependencies 
it contains, and the order of a factorization or parsing algorithm is the maximum of the order 
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of the parts it uses. In this paper, we focus on three factorizations: sibling and grandchild, two 
different second-order parts, and grand-sibling, a third-order part: 


s r 
sibling 




s r t 
grandchild 


s r t 

grand-sibling 


In this paper, we consider only projective trees, where crossing dependencies are not al¬ 
lowed, excluding non-projective trees, where dependencies are allowed to cross. For projective 
parsing, efficient algorithms exist to solve the three problems, for certain factorizations with 
special structures. Non-projective parsing with high-order factorizations is known to be NP- 
hard in computation ( [McDonald and Pereira 20061 [McDonald and Satta 20071 ). In addition, our 
models capture multi-root trees, whose root-symbols have one or more children. A multi-root 
parser is more robust to sentences that contain disconnected but coherent fragments, since it is 
allowed to split up its analysis into multiple pieces. 


2.6 Labeled Parsing 

Our probabilistic model are easily extended to include dependency labels. We denote L as the set 
of all valid dependency labels. We change the feature functions to include label function: 

F 3 iy,x)= fj{p,l,x). 

{p,t)ey 

where I is the vector of dependency labels of edges belonging to part p. We define the order of 
I as the number of labels I contains, and denote it as o{l). It should be noted that the order of I 
is not necessarily equal to the order of p, since I may contain labels of parts of edges in p. For 
example, for the second-order sibling model and the part (s, r, t), I can be dehned to contain only 
the label of edge from word Xg to word Xt- 

The weight function of each part is changed to: 


'j{p, I, x) = exp I ^ I, x) 


(4) 


Based on EquationlH Problem 2 and 3 are rewritten as follows: 


zix)= Y n 

y£T(x) '-(p,i)ey 


and 


m{p,l) = Y^ Pr(y|£c), for all (p, Z). 

yeT{p,l,x) 


This extension increases the computational complexity of time by factor of 0(|L|°(^)), where |L| 
is the size of L. 
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second-order sibling parser 


third-order grand-sibling parser 


Figure 2 

The dynamic-programming structures and derivation of four graph-based dependency parsers with 
different types of factorization. Symmetric right-headed versions are elided for brevity. 


3. Related Work 

3.1 Grammatical Bigram Probability Model 

The probabilistic model described in Section 12.21 is a generalized formulation of the gram¬ 
matical bigram probabilistic model proposed in Eisner (119961 . which is used by several 
works (IPaskin 20011 IKoo et al. 20071 1 Smith and Smith 20071) . In fact, the grammatical bigram 
probabilistic model is a special case of our probabilistic model, by specifying the parts p as 
individual edges. The grammatical bigram model is based on a strong independence assumption; 
that all the dependency edges of a tree are independent of one another, given the sentence x. 

For the first-order model (part p is an individual edge), a variant of the inside-outside 
algorithm, which was proposed by Baker (119791) for probabilistic context-free grammars, can be 
applied for the computation of partition function and marginals for projective dependency struc¬ 
tures. This inside-outside algorithm is built on the semiring parsing framework dGoodman 19991) . 
For non-projective cases. Problems 2 and 3 can be solved by an adaptation of Kirchhoff’s Matrix- 
Tree Theorem (IKoo et al. 20071 ISmith and Smith 20071 . 

3.2 Algorithms of Decoding Problem for Different Factored Models 

It should be noted that if the score of parts is defined as the logarithm of their weight; 

score{p, x) = logu>(p, a;) = ^ x), 

3 

then the decoding problem is equivalent to the form of graph-based dependency parsing with 
global linear model (GFM), and several parsing algorithms for different factorizations have 
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been proposed in previous work. Figure |2] provides graphical specifications of these parsing 
algorithms. 

McDonald et al. (I2005t presented the first-order dependency parser, which decom¬ 
poses a dependency tree into a set of individual edges. A widely-used dynamic program¬ 
ming algorithm (lEisner 20001 was used for decoding. This algorithm introduces two in¬ 
terrelated types of dynamic programming structures: complete spans, and incomplete spans 
( [McDonald, Crammer, and Pereira 2005] l. Larger spans are created from two smaller, adjacent 
spans by recursive combination in a bottom-up procedure. 

The second-order sibling parser ([McDonald and Pereira 200^ breaks up a dependency tree 
into sibling parts—^pairs of adjacent edges with shared head. Koo and Collins (120101 1 proposed a 
parser that factors each dependency tree into a set of grandchild parts. Formally, a grandchild part 
is a triple of indices {g, s, t) where g is the head of s and s is the head of t. In order to parse this 
factorization, it is necessary to augment both complete and incomplete spans with grandparent 
indices. Following Koo and Collins (120101) . we refer to these augmented structures as g-spans. 

The second-order parser proposed in Carreras (120071 ) is capable to score both sibling 
and grandchild parts with complexities of (9(n^) time and 0{n^) space. However, the parser 
suffers an crucial limitation that it can only evaluate events of grandchild parts for outermost 
grandchildren. 

The third-order grand-sibling parser, which encloses grandchild and sibling parts into a 
grand-sibling part, was described in Koo and Collins (120101) . This factorization defines all 
grandchild and sibling parts and still requires 0{n'^) time and 0{n^) space. 


3.3 Transition-based Parsing 


Another category of dependency parsing systems is “transition-based” parsing 
( [Nivre and Scholz 20041 lAttardi 20061 McDonald and Nivre 2007] ) which parameterizes 
models over transitions from one state to another in an abstract state-machine. In these models, 
dependency trees are constructed by taking highest scoring transition at each state until a state 
for the termination is entered. Parameters in these models are typically learned using standard 
classification techniques to predict one transition from a set of possible transitions given a state 
history. 

Recently, several approaches have been proposed to improve transition-based de¬ 
pendency parsers. In the aspect of decoding, beam search (Johansson and Nugu es 20(j7| 
Huang, Jiang, and Liu 2009} and partial dynamic programming (Huang~and Sagae 2010| have 


been applied to improve one-best search. In the aspect of training, global structural learn¬ 
ing has been applied to replace local learning on each decision (Zhang and Clark 2008 
[Huang, Jiang, and Liu 2009) . 


4. Algorithms for High-order Models 

In this section, we describe our algorithms for problem 2 and 3 of three high-order factored 
models: grandchild and sibling, two second-order models; and grand-sibling, which is third- 
order. Our algorithms are built on the idea from the inside-outside algorithm (IPaskin 20011 for 
the first-order projective parsing model. Following this, we define the inside probabilities /3 and 
outside probabilities a over spans (j): 


m 

a{(j)) 


E n wip, x) 

teippet 


E n w{p,x), 

yGT(0) pfyicp) 
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Algorithm 1 

Compute inside probability /3 for second-order Grandchild Model 


Require: /3(C's J = 1.0 Mg, s 
1: for fc = 1 to n 
2: for s = 0 to n — A: 

3: t = s + k 

4: for g < s or g > t 

5: /3(/f.t)= E • <, /3(/E) 

s<r<t 

6: /3(CE)= E Pillr) ■ P{Cls) 

s<r<t 

7: end for 
8: end for 

Require: /3(Cs,s) = 1.0 Ms 
9: for fc = 1 to n 


E /3(CE)-/3(C'E+i)-tt;E 

s<r<t 


E /3(/E) •/3(c'E) 

s<r<t 


10: 


s = n — k,t = k 


11 : /3(V)= E /3(Co..)-/3(C,%i)-<, 

0<r<i 


12 : f3iCo,t)= E /3{V)-/3{CO*) 

0<r<i 




13: end for 


s<r<.n 

E P{Iu,r) ■ P{C^,s) 

s<r<.n 


where f is a sub-structure of a tree and j/(0) is the sub-structure of tree y that belongs to span (j). 

4.1 Model of Grandchild Factorization 

In the second-order grandchild model, each dependency tree is factored into a set of grandchild 
parts— pairs of dependencies connected head-to-tail. Formally, a grandchild part is a triple of 
indices {g, s, t) where both {g, s) and (s, t) are dependencies. 

In order to compute the partition function Z{x) and marginals m{g,s,t) for this factor¬ 
ization, we augment both incomplete and complete spans with grandparent indices. This is 
similar to Koo and Collins (120101 1 for the decoding algorithm of this grandchild factorization. 
Following Koo and Collins (120101 1. we refer to these augmented structures as g-spans, and 
denote an incomplete g-span as where Ig^t is a normal complete span and g is the index 
of a grandparent lying outside the range [s, t], with the implication that (p, s) is a dependency. 
Complete g-spans are defined analogously and denoted as Cf j. In addition, we denote the weight 
of a grandchild part [g, s, t) as wf ^ for brevity. 

The algorithm for the computation of inside probabilities /3 is shown as Algorithm [T] The 
dynamic programming derivations resemble those of the decoding algorithm of this factorization, 
the only difference is to replace the maximization with summation. The reason is obvious, since 
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Algorithm 2 

Compute outside probability a for second-order Grandchild Model 


Require: = 1-0, Q;(/n,o) = 1-0 

1: for fc = n to 1 
2: s = n — k,t = k 

3: a(Co,t)= E /3(C70,+i) • a(V) • WoV 

t<ir<n 

4: a(/o,t)= E ■ a{Co,r) 

t<r<n 

5: end for 

Require: a(/o%) = 1.0, a(/;^,o) = 1-0 


a{C„,s) = E /3(C"s-l) • 0-{I„,r) ■ 

0<r<s 

a{I-a,s)= E 

0<r<s 


6 
7: 

8 
9 
10 : 

11 

12 

13 

14 

15 

16 

17 

18 

19: 

20 

21 

22 : 

23: 

24 

25: 

26 


for fc = n to 1 
for s = 0 to n — fc 
t = s + k 
for (? < s 


«(CE)= E E 

t<.r<n r<ig\/r'>t 

= E PiCr,s-l) ■ E /3(Cg,s-l) ' ' Wff.t 


5 f<r<s r<g'Vr>t 

ifp = 0 

a(CE) ± /3(/o,.) • a(Co,t) 

end if 

«aE)= E • a(C|, J 

t<r<n 

end for 
for p > f 


a(CE) - ^(Co.s-i) • a{Io,t) ■ 


«UE)= E /?(CE-a(CE) 

g<.r<s 


t<r<g 


r<is\/r>g 


«(c^E)= E E /?(EE) • 


0ieq'r<s 

ifp = n 

«(CE)±/?(/„,t+i).a(C„,,).<, 

end if 

«aE)= E /3(Q,J j 

t<r<g 


r<.s\/r'>g 




E /3(CE)-a(CE) 

0<r<s 


end for 
end for 
end for 
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the spans defined for the two algorithms are the same. Note that since our algorithm considers 
multi-root dependency trees, we should perform another recursive step to compute the inside 
probability /3 for the complete span Cq,*, after the computation of (3 for all g-spans. 

Algorithm |2] illustrates the algorithm for computing outside probabilities a. This is a top- 
down dynamic programming algorithm, and the key of this algorithm is to determine all the 
contributions to the final Z (x) for each g-span; fortunately, this can be done deterministically for 
all cases. For example, the complete g-span Cf ^ with g < s < t has two different contributions; 
combined with a g-span of which r > f, in the right side to build up a larger g-span If 

or combined with a g-span Ig g, of which r > f or r < p, in the left side to form a larger g-span 
Cg j. So a{C^g^) is the sum of two items, each of which corresponds to one of the two cases (See 
Algorithm|2]l. It should be noted that complete g-spans Cl^ with p = 0 or p = n are two special 
cases. 

After the computation of (3 and a for all spans, we can get marginals using following 
equation; 


m{g,s,t) = /3{Il^) ■ a(/f t)/z(at). 

Since the complexity of the both Algorithm [T| and Algorithm |2] is 0{n^) time and 0{n^) space, 
the complexity overall for training this model is O(n^) time and 0(n^) space, which is the same 
as the decoding algorithm of this factorization. 

4.2 Model of Sibling Factorization 

In order to parse the sibling factorization, a new type of span; sibling spans, is de¬ 
fined (IMcDonald 20061 . We denote a sibling span as Ss,t where s and t are successive modifiers 
with a shared head. Formally, a sibling span Ss,t represents the region between successive 
modifiers s and t of some head. The graphical specification of the second-order sibling model 
for dynamic-programming, which is in the original work of Eisner (lEisner 19961 1. is shown in 
Eigure |2] The key insight is that an incomplete span is constructed by combining a smaller 
incomplete span with a sibling span that covers the region between the two successive modifiers. 
The new way allows for the collection of pairs of sibling dependents in a single state. It is no 
surprise that the dynamic-programming structures and derivations of the algorithm for computing 
/3 is the same as that of the decoding algorithm, and we omit the pseudo-code of this algorithm. 

The algorithm for computing a can be designed with the new dynamic programming 
structures. The pseudo-code of this algorithm is illustrated in Algorithm [3 We use tUs,r.i to 
denote the weight of a sibling part {s,r,t). The computation of marginals of sibling parts is 
quite different from that of the first-order dependency or second-order grandchild model. Eor the 
introduction of sibling spans, two different cases should be considered; the modifiers are at the 
left/right side of the head. In addition, the part (s, —, t), which represents that t is the inner-most 
modifier of s, is a special case and should be treated specifically. We can get marginals for all 
sibling parts with s < r < t as following; 

m(s,r,t) = I3{ls,r) ■ l3{Sr,t) ■ a{Is,t) ■ Ws,r,t/z{x) 
m{t, r, s) = l3(Ss,r) ■ (3{It,r) ■ 0:{lt,s) ■ Wt,r,s/z{x) 
m{s,t) = l3{Ct,s+i) ■ a{Is,t) ■ Ws-,t/z(x) 
m{t,s) = l3{Cs,t-i) ■ a{It,s) ■ wt-,s/z{x), 

Since each derivation is defined by a span and a split point, the complexity for training and 
decoding of the second-order sibling model is O(n^) time and 0{n?) space. 
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Algorithm 3 

Compute outside probability a for second-order Sibling Model 

Require: a(C'o,„) = 1.0 a(C„^o) = 1-0 
1: for fc = n to 1 
2: for s = 0 to n — k 
3: t = s + k 

4: Ck(Ss^t) — ^ /5(.^r,s) ' ^ ‘ ^{^Ir,s) ' 

0<r<s t<,r<n 

5: a{Cs,t)= E P{Cr,t+i)-a{S,,r)+ E I3{lr,s) ■ a{Cr,t) 

t<r<n 0<r<s 

: + /3{Ct+i^t+i) • o:{It+i^s) ■ tft-i-i.-.s 

6: a{Ct,s)= E ^iCr,s-i) ■ a{Sr,t) + E /3(/r.t) • ct(a,.) 

0<r<s t<.r<n 

• H“ —l,s—l) ' — — 

7: a(/g t) = ^ /3(<5't,r) ■ a(7^s,r) • + Et<r<n^(C'r,t) ■ a(C's,r) 

t<.r<n 

8: a(/t,s) = Eo<r<s l^i.Sr,s) ■ a{It,r) ■ Wt,s,r + Eo<r<s ^(C's.r) ’ a{Ct^r) 

9 : end for 
10: end for 


4.3 Model of Grand-Sibling Factorization 

We now describe the algorithms of the third-order grand-sibling model. In this model, each tree 
is decomposed into grand-sibling parts, which enclose grandchild and sibling parts. Formally, a 
grand-sibling is a 4-tuple of indices (p, s, r, t) where (s, r, t) is a sibling part and {g, s, t) is a 
grandchild part. The algorithm of this factorization can be designed based on the algorithms for 
grandchild and sibling models. 

Like the extension of the second-order sibling model to the first-order dependency model, 
we define the sibling g-spans where Ss,t is a normal sibling span and g is the index of the 
head of s and t, which lies outside the region [s, t] with the implication that (g, s, t) forms a valid 
sibling part. This model can also be treated as an extension of the sibling model by augmenting 
it with a grandparent index for each span, like the behavior of the grandchild model for the first- 
order dependency model. Figured provides the graphical specification of this factorization for 
dynamic-programming, too. The overall structures and derivations is similar to the second-order 
sibling model, with the addition of grandparent indices. The same to the second-order grandchild 
model, the grandparent indices can be set deterministically in all cases. 

The pseudo-code of the algorithm for the computation of the outside probability a is 
illustrated in Algorithm |4] It should be noted that in this model there are two types of special 
cases—one is the sibling-g-span ^ with g = 0 or g = n, as the complete g-span Cf ^ with 
g = 0 or g = n in the second-order grandchild model; another is the inner-most modifier case 
as the second-order sibling model. We use tuf ^ j to denote the weight of a grand-sibling part 
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Algorithm 4 

Compute outside probability a for third-order Grand-sibling Model 


Require: «(/o,n) = 1-0, a(/n,o) = 1-0, a(Co,n) = 1-0, a(C„,o) = 1-0 
1: for fc = n to 1 
2: s = n ~ k,t = k 

3: a(/o,t) = /3(Ct°„)-a(Co.„)-f E /3(S0,) • a(/o,r) • 

t<.r<n 

4: a(/„,,) = /3(CoE)-«(C„.o)-f E /3(S"J 

0<r <s 

5: end for 

Require: = 1.0, «(/" q) = 1-0 


for fc = n to 1 
for s = 0 to n — fc 
t = s -\- k 

ioY g < s 


10: 

11 : 


r<g\/r>t 


if 9 = 0 t) - /3(fo,s) • o(fo.t) • ■u’o.s.t 


t<r<n r<gyr>t 


12: «(C3J= E if<? = s-l E /3(Cf,.) ■ 0(7-,) ' 

gdrCis r<5Vr>t 

13: «(4®t)= E /3(S,V).a(7|,4-<t,,+ E ■ a(Clr) 

t<.r<n t<r<n 

14: E )■<.,.+ E /3(Ci,,)-a(C«,) 

g<r<s g<r<s 

15: end for 

16: for g > t 


17: 

18: 

19: 

20: 

21 : 


«(SL)= E 


r<.sVr>g 


g,s ) ‘ ^g,t,s 


itg = n a{S?t) = l3{I„,t)-a{I„_s)-w” 


»{Cl,)= E PiC^,t+i) ■ »iSlr) ifg = t + l E /3(Cf,t) ■ a(7,E) . 

t<r<g r<sVr>g 

C‘{cls)= E + E 

0<r<s r<isVr^g 

«(4,t)= E /3(5tV) -oaf,.) •<*,.+ E f}{Clr)-»{Cl,r) 

tarKg t<r<C.g 

»(lls)= E ■<,,,+ E /3(C*,,) ■ a(c»,) 

0<r<s 0<r<s 


22: end for 

23: end for 
24: end for 
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Table 1 

Training, development and test data for PTB, CTB and PDT. ^sentences and Awards refer to the 
number of sentences and the number of words excluding punctuation in each data set, respectively. 




sections 

^sentences 

^words 


Training 

2-21 

39,832 

843,029 

PTB 

Dev 

22 

1,700 

35,508 


Test 

23 

2,416 

49,892 


Training 

001-815; 1001-1136 

16,079 

370,777 

CTB 

Dev 

886-931; 1148-1151 

804 

17,426 


Test 

816-885; 1137-1147 

1,915 

42.773 


Training 

- 

73088 

1,255,590 

PDT 

Dev 

- 

7,318 

126,028 


Test 

- 

7,507 

125,713 


{g, s, r, t) and the marginals for all grand-sibling parts with s < r < t can be computed as 
follows: 


m{g,s,r,t) = ^(11^) 

m{g,t,r,s) = z{x) 

m{g,s,-,t) = /3(C't%+i) • ■wl_^Jz{x) 

m{g, t,s) = /3(C'* • a{Il,) ■ wi_ Jz{x), 

Despite the extension to third-order parts, each derivation is still defined by a g-span and a split 
point as in second-order grandchild model, so training and decoding of the grand-sibling model 
still requires Oin^) time and 0{n^) space. 

5. Experiments for Dependency Parsing 

5.1 Data Sets 


We implement and evaluate the proposed algorithms of the three factored mod¬ 
els (sibling, grandchild and grand-sibling) on the Penn English Treebank (PTB version 
3.0) (Marcus, Santorini, and Marcinkiewicz 1993| , the Penn Chinese Treebank (CTB version 
5.0) (IXue et al. 2005b and Prague Dependency Treebank (PDT) ( Hajic 19M} Hajic et al. 200 l| l. 

For English, the PTB data is prepared by using the standard split: sections 2-21 are used for 
training, section 22 is for development, and section 23 for test. Dependencies are extracted by 
using Penn2MalQ tool with standard head rules (Yamada and Matsumoto 20031 l. For Chinese, 
we adopt the data split from Zhang and Clark (120091 1. and we also used the Penn2Malt tool 
to convert the data into dependency structures. Since the dependency trees for English and 
Chinese are extracted from phrase structures in Penn Treebanks, they contain no crossing edges 
by construction. For Czech, the PDT has a predefined training, develc^ing and testing split, we 
"projectivized" the training data by finding best-match projective tree: 


iqpii 

I 


2 http://w3.nisi.vxu.sernivre/research/Penn2Malt.html 

3 Projective trees for training sentences are obtained by running the first-order projective pai'ser with an oracle model 
that assigns a score of +1 to correct edges and -1 otherwise. 
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All experiments were running using every single sentence in each set of data regardless of 
length. Parsing accuracy is measured with unlabeled attachment score (UAS); the percentage 
of words with the correct head, root accuracy (RA); the percentage of correctly identified root 
words, and the percentage of complete matches (CM). Following the standard of previous work, 
we did not include punctuatioifl in the calculation of accuracies for English and Chinese. The 
detailed information of each treebank is showed in Table[T] 

5.2 Feature Space 

Following previous work for high-order dependency parsing (IMcDonald and Pereira 200^ 
ICarreras 20071 IKoo and Collins 2010b . higher-order factored models captures not only features 
associated with corresponding higher order parts, but also the features of relevant lower order 
parts that are enclosed in its factorization. For example, third-order grand-sibling model evaluates 
parts for dependencies, siblings, grandchildren and grand-siblings, so that the feature function of 
a dependency parse is given by: 

F{y,x)= ^ fdep{s,t,x) 

{s,t)ey 

T ^ ' fsib{s,'r, t, x) 

{s,r,t)ey 

+ E fgchiQj 

{g,s,t)ey 

+ E fgsibig, S, r, t, x) 

(g,s,r,t)£y 


where fdep, fsib, fgch, and fgsib are the feature functions of dependency, sibling, grandchild, and 
grand-sibling parts. 

First-order dependency features fdep, second-order sibling features fsib and 
second-order grandchild features fgch are based on feature sets from previous 
work ( [McDonald, Crammer, and Pereira 2003] IMcDonald and Pereira 20061 ICarreras 20071 1, to 
which we added lexicalized versions of several features. For instance, our first-order feature 
set contains lexicalized “in-between” features that recognize word types that occur between 
the head and modifier words in an attachment decision, while previous work has dehned 
in-between features only for POS tags. As another example, the second-order features fsib and 
fgch contains lexical trigram features, which also excluded in the feature sets of previous work. 
The third-order grand-sibling features are based on Koo and Collins (IKoo and Collins 20101) . 
All feature templates for used in our parsers are outlined in Table |2] 

According to Table |2] several features in our parser depend on part-of-speech (POS) 
tags of input sentences. For English, POS tags are automatically assigned by the SVMTool 
tagger ( [Gimenez and Marquez 2004] ); For Chinese, we used gold-standard POS tags in CTB. 
Following Koo and Collins (1201 Oi l, two versions of POS tags are used for any features involve 
POS: one using is normal POS tags and another is a coarsened version of the POS tags@ 


4 English evaluation ignores any token whose gold-standard POS is one of {” “ : ,Chinese evaluation ignores any 
token whose tag is “PU” 

5 For English, we used the first two characters, except PRP and PRP$ for Czech, we used the first character of the 
tag; for Chinese, we dropped the last character, except PU and CD 
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Table 2 

All feature templates of different factorizations used by our parsing algorithms. L(-) and P( ) are the 
lexicon and POS tag of each token. 


dependency features for part (s, t) 

uni-gram features 

bi-gram features 

context features 

L(s)-P(s) 

L(s) 

P(s) 

L(t).P(t) 

L(t) 

P(t) 

L(s).P(s).L(t).P(t) 

L(s).P(s)-P(t) 

L(s).P(s).L(t) 

L(s).L(t) 

P(s).L(t).P(t) 

L(s)-L(t).P(t) 

P(s)-P(t) 

P(s)-P(t)-P(SH-1)-P(t-1) 

P(s)-P(t)-P(s-1)-P(t-1) 

P(s)-P(t)-P(sH-l)-P(tH-l) 

P(s)-P(t)-P(SH-1)-P(t-1) 

in between features 

L(s)-L(b).L(t) 

P(s).P(b).P(t) 

grandchild features for part (p, s, t) 

sibling features for part (s, r, t) 

tri-gram features 

backed-off features 

tri-gram features 

backed-off features 

L(g).L(s).L(t) 

L(g)-L(t) 

L(s).L(r).L(t) 

L(r).L(t) 

P(g)-P(s).P(t) 

P(g)-P(t) 

P(s).P(r).P(t) 

P(r).P(t) 

L(g).P(g).P(s).P(t) 

L(g)-P(t) 

L(s).P(s).P(r).P(t) 

L(r).P(t) 

P(g)-L(s).P(s).P(t) 

P(g)-L(t) 

P(s).L(r).P(r).P(t) 

P(r)-L(t) 

P(g)-P(s).L(t).P(t) 


P(s).P(r).L(t).P(t) 


grand-sibling features for part (p, s, r, t) 

4-gram features 

context features 

backed-off features 

L(g).P(s).P(r).P(t) 

P(g)-P(s).P(r).P(t).P(g+l).P(s+l).P(t+l) 

L(g).P(r).P(t) 

P(g)-L(s).P(r).P(t) 

P(g)-P(s).P(r).P(t).P(g-l).P(s-l).P(t-l) 

P(g)-L(r).P(t) 

P(g)-P(s)-L(r).P(t) 

P(g)-P(s)-P(r).P(t).P(g+l).P(s+l) 

P(g)-P(r).L(t) 

P(g)P(s).P(r).L(t) 

P(g)-P(s).P(r).P(t).P(g-l).P(s-l) 

L(g).L(r).P(t) 

L(g).L(s).P(r).P(t) 

P(g)-P(r).P(t).P(g+l).P(r+l).P(t+l) 

L(g).P(r).L(t) 

L(g).P(s).L(r).P(t) 

P(g)-P(r).P(t).P(g+l)-P(r-l).P(t-l) 

P(g)-L(r).L(t) 

L(g).P(s).P(r).L(t) 

P(g)-P(r).P(g+l).P(r+l) 

P(g)-P(r).P(t) 

P(g)-L(s).L(r).P(t) 

P(g)-P(r).P(g-l).P(r-l) 


L(g).L(s).P(r)-L(t) 

P(g)-P(t).P(g+l).P(t+l) 


P(g)-P(s)-L(r).L(t) 

P(g)-P(t).P(g-l).P(t-l) 


P(g)P(s).P(r).P(t) 

P(r)-P(t)-P(rH-l)-P(tH-l) 



P(r)-P(t)-P(r-1)-P(t-1) 



coordination features 

L(g).P(s) P(g)-P(s) 

L(g)-L(s).L(t) 

L(g).P(s).P(t) 

P(g)-L(s) P(g)-L(t) 

L(g).P(t) P(g)-P(t) 

P(g)-L(s)-P(t) 

P(g)-P(s)-L(t) 

L(s).P(t) P(s).L(t) 

P(s).P(t) 

L(g)-L(s).P(t) 

L(g).P(s)-L(t) 



P(g)-L(s)-L(t) 

P(g)-P(s).P(t) 



5.3 Model Training 

Since the log-likelihood L{X) is a convex function, gradient descent methods can be used to 
search for the global minimum. The method of parameter estimation for our models is the limited 
memory BFGS algorithm (L-BFGS) dNash andNocedal 19911 1, with L2 regularization. L-BFGS 
algorithm is widely used for large-scale optimization, as it combines fast training time with 
low memory requirement which is especially important for large-scale optimization problems. 
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Table 3 

UAS, RA and CM of three factored models: Sib for sibling, Gch for grandchild and GSib for grand-sibling. 



Eng 


L-BFGS 

MIRA 

AP 


UAS RA 

CM 

UAS 

RA 

CM 

UAS 

RA 

CM 

Sib 

92.4 95.4 

46.4 

92.5 

95.1 

45.7 

91.9 

94.8 

44.1 

Gch 

92.2 94.9 

44.6 

92.3 

94.7 

44.0 

91.6 

94.5 

41.6 

GSib 

93.0 96.1 

48.8 

93.0 

95.8 

48.3 

92.4 

95.5 

46.6 


Chn 


L-BFGS 

MIRA 

AP 


UAS RA 

CM 

UAS 

RA 

CM 

UAS 

RA 

CM 

Sib 

86.3 78.5 

35.0 

86.1 

77.8 

34.1 

84.0 

74.2 

31.1 

Gch 

85.5 78.0 

33.3 

85.4 

77.6 

31.7 

83.9 

74.9 

29.6 

GSib 

87.2 80.0 

37.0 

87.0 

79.5 

35.8 

85.1 

77.1 

32.0 


Cze 


L-BFGS 

MIRA 

AP 


UAS RA 

CM 

UAS 

RA 

CM 

UAS 

RA 

CM 

Sib 

85.6 90.8 

36.3 

85.5 

90.5 

35.1 

84.6 

89.5 

34.0 

Gch 

86.0 91.8 

36.5 

85.8 

91.4 

35.6 

85.0 

90.2 

34.6 

GSib 

87.5 93.2 

39.3 

87.3 

92.9 

38.4 

86.4 

92.1 

36.9 


Meanwhile, L-BFGS can achieve highly competitive performance. Development sets are used 
for tuning the hyper-parameter C which dictates the level of the regularization in the model. 

For the purpose of comparison, we also run experiments on graph-based dependency 
parsers of the three different factorizations, employing two online learning methods: The k- 
best version of the Margin Infused Relaxed Algorithm (MIRA) (jCrammer and Singer 2003 1 
ICrammer et al. 20061 [McDonald 2006t with fc = 10, and averaged structured percep- 
tron (AP) ( Freund and Schapire 1999[ ICollins 2QQ2i . Both the two learning methods are used 
in previous work for training graph-based dependency parsers and achieved highly compet¬ 
itive parsing accuracies—A:-best MIRA is used in McDonald et al. (120051 1. McDonald and 
Pereira (120061) . and McDonald and Nivre (120071) . and AP is used in CaiTeras (120071 ) and Koo and 
Collins (120101) . Each parser is trained for 10 iterations and selects parameters from the iteration 
that achieves the highest parsing performance on the development set. 

The feature sets were fixed for all three languages. For practical reason, we exclude the 
sentences containing more than 100 words in all the training data sets of Czech, English and 
Chinese in all experiments. 


5.4 Results and Analysis 

Table [3] shows the results of three different factored parsing models trained by three different 
learning algorithms on the three treebanks of PTB, CTB and PDT. Our parsing models trained by 
L-BEGS method achieve significant improvement on parsing performance of the parsing models 
trained by AP for all the three treebanks, and obtain parsing performance competitive with the 
parsing models trained by MIRA. Eor example, for the third-order grand-sibling model, the 
parsers trained by L-BEGS method improve the UAS of 0.6% for PTB, 2.1% for CTB and 1.1% 
for PDT, compared with the parsers trained by AP. Eor the parsers trained by MIRA, our parsers 
achieve the same UAS for PTB, and higher parsing accuracies (about 0.2% better) for both CTB 
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Table 4 

Training time for three models. ^Core refers to the number of cores. 


#Core 

MIRA 

1 

4 

L-BFGS 

10 

18 

Sib 

33.3h 

27.4h 

10.9h 

6.7h 

Gch 

160.6h 

146.5h 

59.8h 

22.4h 

GSib 

300.0h 

277.6h 

115.7h 

72.3h 


and PDT. Moreover, it should be noticed that our algorithms achieve significant improvement of 
RA and CM on all three treebanks for the parsers trained by MIRA, although the parsers trained 
by L-BFGS and MIRA exhibit no statistically significant different in the parsing performance of 
UAS. 

As mentioned above, parallel computation techniques could be applied to our models to 
speed up parser training. Table|4]lists the average training time for our three models with different 
number of cores. According to this table, the training time of our parsers trained by off-line L- 
BFGS method with more than 10 cores is much less than the cost of the parsers trained by online 
learning methods MIRA. We omit the training time of online learning method AP, since the 
training times for MIRA and AP are nearly the same according to our experiences. The reason 
is that the time for updating parameters, which is the only difference between MIRA and AP, 
makes up a very small proportion (less than 10% ) of the total training time. 

5.5 Comparison with Previous Works 

Table |5] illustrates the UAS and CM of related work on PTB, CTB and PDT for comparison. 
Our experimental results show an improvement in performance of English and Chinese over 
the results in Zhang and Clark j2008l) . which combining graph-based and transition-based 
dependency parsing into a single parser using the framework of beam-search, and Zhang and 
Nivre (1201II . which are based on a transition-based dependency parser with rich non-local 
features. For English and Czech, our results are better than the results of the two third-order 


Table 5 

Accuracy comparisons of different dependency parsers on PTB, CTB and PDT. 



Eng 

Chn 

Cze 


UAS 

CM 

UAS 

CM 

UAS 

CM 

McDonald et al. (12005b 

90.9 

36.7 

79.7 

27.2 

84.4 

32.2 

McDonald and Pereira (I2006I 

91.5 

42.1 

82.5 

32.6 

85.2 

35.9 

Zhans and Clark (|2008t) 

92.1 

45.4 

85.7 

34.4 

- 

- 

Zhang and Nivre (1201 Ik 

92.9 

48.0 

86.0 

36.9 

- 

- 

Koo and Collins (|2010ll, model2 

92.9 

- 

- 

- 

87.4 

- 

Koo and Collins (|2010k. model 1 

93.0 

- 

- 

- 

87.4 

- 

this paper 

93.0 

48.8 

87.2 

37.0 

87.5 

39.3 

Koo et al. (12008k* 

93.2 

- 

- 

- 

87.1 

- 

Suzuki et al. (12009k* 

93.8 

- 

- 

- 

88.1 

- 

Zhang and Clark (12009k* 

- 

- 

86.6 

36.1 

- 

- 
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graph-based dependency parsers in Koo and Collins (120101 1. The models marked * cannot be 
compared with our work directly, as they exploit large amount of additional information that 
is not used in our models, whiling our parses obtain results competitive with these works. For 
example, Koo et al. (I2008t and Suzuki et al. (120091 1 make use of unlabeled data, and the parsing 
model of Zhang and Clark (120091 1 utilizes phrase structure annotations. 


6. Conclusion 

In this article, we have described probabilistic models for high-order projective dependency 
parsing, obtained by relaxing the independent assumption of the previous grammatical bigram 
model, and have presented algorithms for computing partition functions and marginals for 
three factored parsing models—second-order sibling and grandchild, and third-order grand¬ 
sibling. Our methods achieve competitive or state-of-the-art performance on three treebanks for 
languages of English, Chinese and Czech. By analyzing errors on structural properties of length 
factors, we have shown that the parsers trained by online and off-line learning methods have 
distinctive error distributions despite having very similar parsing performance of UAS overall. 
We have also demonstrated that by exploiting parallel computation techniques, our parsing 
models can be trained much faster than those parsers using online training methods. 
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